

## Processor Pipeline

Instructor: Nima Honarmand



## Processor Datapath



#### The Generic Instruction Cycle

- Steps in processing an instruction:
  - Instruction Fetch (IF\_STEP)
  - Instruction Decode (ID\_STEP)
  - Operand Fetch (**OF\_STEP**)
    - Might be from registers or memory
  - Execute (EX\_STEP)
    - Perform computation on the operands
  - Result Store or Write Back (RS\_STEP)
    - Write the execution results back to registers or memory
- ISA determines what needs to be done in each step for each instruction
- μArch determines how HW implements the steps



#### Example: MIPS Instruction Set

All instructions are 32 bits





## A Simple MIPS Datapath





#### Datapath vs. Control Logic

- Datapath is the collection of HW components and their connection in a processor
  - Determines the static structure of processor
- *Control logic* determines the dynamic flow of data between the components
  - E.g., the control lines of MUXes and ALU in last slide
  - Is a function of?
    - Instruction words
    - State of the processor
    - Execution results at each stage



#### Single-Instruction Datapath



- Process one instruction at a time
- Single-cycle control: hardwired
  - Low CPI (1)
  - Long clock period (to accommodate slowest instruction)
- Multi-cycle control: typically micro-programmed
  - Short clock period
  - High CPI
- Can we have both low CPI and short clock period?
  - Not if datapath executes only one instruction at a time
  - No good way to make a single instruction go faster



#### Pipelined Datapath



- Start with multi-cycle design
- When insn0 goes from stage 1 to stage 2
   ... insn1 starts stage 1
- Each instruction passes through all stages ... but instructions enter and leave at faster *rate*

| Style        | Ideal CPI | Cycle Time (1/freq) |
|--------------|-----------|---------------------|
| Single-cycle | 1         | Long                |
| Multi-cycle  | > 1       | Short               |
| Pipelined    | 1         | Short               |



#### Pipeline Illustrated



Pipeline Latency = n Gate Delay + (p-1) register delays p: # of stages

Improves throughput at the expense of latency



# 5-Stage MIPS Pipeline



#### Stage 1: Fetch

- Fetch an instruction from instruction cache every cycle
  - Use PC to index instruction cache
  - Increment PC (assume no branches for now)
- Write state to the pipeline register (IF/ID)
  - The next stage will read this pipeline register



#### Stage 1: Fetch Diagram





#### Stage 2: Decode

- Decodes opcode bits
  - Set up Control signals for later stages
- Read input operands from register file
  - Specified by decoded instruction bits
- Write state to the pipeline register (ID/EX)
  - Opcode
  - Register contents, immediate operand
  - PC+1 (even though decode didn't use it)
  - Control signals (from insn) for opcode and destReg



Stage 2: Decode Diagram





#### Stage 3: Execute

- Perform ALU operations
  - Calculate result of instruction
    - Control signals select operation
    - Contents of regA used as one input
    - Either regB or constant offset (imm from insn) used as second input
  - Calculate PC-relative branch target
    - PC+1+(constant offset)
- Write state to the pipeline register (EX/Mem)
  - ALU result, contents of regB, and PC+1+offset
  - Control signals (from insn) for opcode and destReg







#### Stage 4: Memory

- Perform data cache access
  - ALU result contains address for LD or ST
  - Opcode bits control R/W and enable signals
- Write state to the pipeline register (Mem/WB)
  - ALU result and Loaded data
  - Control signals (from insn) for opcode and destReg

### Stage 4: Memory Diagram





#### Stage 5: Write-back

- Writing result to register file (if required)
  - Write Loaded data to destReg for LD
  - Write ALU result to destReg for ALU insn
  - Opcode bits control register write enable signal



#### Stage 5: Write-back Diagram





### Putting It All Together





# Issues With Pipelining



### Pipelining Idealism

- Uniform Sub-operations
  - Operation can partitioned into uniform-latency sub-ops

- Repetition of Identical Operations
  - Same ops performed on many different inputs

- Independent Operations
  - All ops are mutually independent



#### Pipeline Realism

- Uniform Sub-operations ... NOT!
  - Balance pipeline stages
    - Stage quantization to yield balanced stages
    - Minimize internal fragmentation (left-over time near end of cycle)
- Repetition of Identical Operations ... NOT!
  - Unifying instruction types
    - Coalescing instruction types into one "multi-function" pipe
    - Minimize external fragmentation (idle stages to match length)
- Independent Operations ... NOT!
  - Resolve data and resource hazards
    - Inter-instruction dependency detection and resolution



#### The Generic Instruction Pipeline





Balancing Pipeline Stages



$$T_{IF} = 6$$
 units

$$T_{ID}$$
 = 2 units

$$T_{ID}$$
 = 9 units

$$T_{EX} = 5$$
 units

$$T_{OS} = 9$$
 units

#### Without pipelining

$$T_{cyc} \approx T_{IF} + T_{ID} + T_{OF} + T_{EX} + T_{OS}$$
$$= 3I$$

#### **Pipelined**

$$T_{cyc} \approx max\{T_{IP}, T_{ID}, T_{OP}, T_{EX}, T_{OS}\}$$
  
= 9

Speedup = 
$$31 / 9 = 3.44$$



## Balancing Pipeline Stages (1/2)

- Two methods for stage quantization
  - Divide sub-ops into smaller pieces
  - Merge multiple sub-ops into one
- Recent/Current trends
  - Deeper pipelines (more and more stages)
  - Pipelining of memory accesses
  - Multiple different pipelines/sub-pipelines



## Balancing Pipeline Stages (2/2)

Coarser-Grained Machine Cycle:
4 machine cyc / instruction

Finer-Grained Machine Cycle: 11 machine cyc /instruction





# stages = II $T_{cyc}$  = 3 units



#### Pipeline Examples







### Instruction Dependencies (1/2)

- Data Dependence
  - <u>Read-After-Write</u> (<u>RAW</u>) (the only true dependence)
    - Read must wait until earlier write finishes
  - Anti-Dependence (WAR)
    - Write must wait until earlier read finishes (avoid clobbering)
  - Output Dependence (WAW)
    - Earlier write can't overwrite later write
- Control Dependence (a.k.a. Procedural Dependence)
  - Branch condition must execute before branch target
  - Instructions after branch cannot run before branch



## Instruction Dependencies (1/2)

```
From Quicksort: # for (; (j < high) && (array[j] < array[low]); ++j);
```





### Hardware Dependency Analysis

- Processor must handle
  - Register Data Dependencies (same register)
    - RAW, WAW, WAR
  - Memory Data Dependencies (same address)
    - RAW, WAW, WAR
  - Control Dependencies



#### Pipeline Terminology

#### Pipeline Hazards

- Potential violations of program dependencies
  - Due to multiple in-flight instructions
- Must ensure program dependencies are not violated

#### Hazard Resolution

- Static method: compiler guarantees correctness
  - By inserting No-Ops or independent insns between dependent insns
- Dynamic method: hardware checks at runtime
  - Two basic techniques: Stall (costs perf.), Forward (costs hw)

#### Pipeline Interlock

- Hardware mechanism for dynamic hazard resolution
- Must detect and enforce dependencies at runtime



#### Pipeline: Steady State





#### Data Hazards

- Necessary conditions:
  - WAR: write stage earlier than read stage
    - Is this possible in IF-ID-RD-EX-MEM-WB?
  - WAW: write stage earlier than write stage
    - Is this possible in IF-ID-RD-EX-MEM-WB?
  - RAW: read stage earlier than write stage
    - Is this possible in IF-ID-RD-EX-MEM-WB?
- If conditions not met, no need to resolve
- Check for both register and memory



#### Pipeline: Data Hazard



- How to detect?
  - Compare read register specifiers for newer instructions with write register specifiers for older instructions



### Option 1: Stall on Data Hazard



- IF/ID pipeline latch not updated
- Send no-op down pipeline (called a bubble)



# Option 2: Forwarding Paths (1/3)



MEM——ALU

Requires stalling even with forwarding paths



# Option 2: Forwarding Paths (2/3)





Option 2: Forwarding Paths (3/3)





## Pipeline: Control Hazard





#### Option 1: Stall on Control Hazard



- Stop fetching until branch outcome is known
  - Send no-ops down the pipe
- Easy to implement
- Performs poorly
  - On out of 6 instructions are branches
  - Each branch takes 3 cycles
  - CPI =  $1 + 3 \times 1/6 = 1.5$  (lower bound)



#### Option 2: Prediction for Control Hazards



- Predict branch not taken
- Send sequential instructions down pipeline
- Must stop memory and RF writes
- Kill instructions later if incorrect
- Fetch from branch target



#### Option 3: Delay Slots for Control Hazards

- Another option: delayed branches
  - # of delay slots (ds): stages between IF and where the branch is resolved
    - 3 in our example
  - Always execute following ds instructions
  - Put useful instruction there, otherwise no-op
- Losing popularity
  - Just a stopgap (one cycle, one instruction)
  - Superscalar processors (later)
    - Delay slot just gets in the way (special case)



# Superscalar Pipelines

Instruction-Level Parallelism Beyond Simple Pipelines



# Going Beyond Scalar

- Scalar pipeline limited to CPI ≥ 1.0
  - Can never run more than 1 insn per cycle
- "Superscalar" can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0)
  - <u>Superscalar</u> means executing multiple insns in parallel



#### Architectures for Instruction Parallelism

- Scalar pipeline (baseline)
  - Instruction/overlap parallelism = D
  - Operation Latency = 1
  - Peak IPC = 1.0







# Superscalar Machine

- Superscalar (pipelined) Execution
  - Instruction parallelism = D x N
  - Operation Latency = 1
  - Peak IPC = N per cycle





# Superscalar Example: Pentium





#### Pentium Hazards & Stalls

- "Pairing Rules" (when can't two insns exec?)
  - Read/flow dependence
    - mov eax, 8
    - mov [ebp], eax
  - Output dependence
    - mov eax, 8
    - mov eax, [ebp]
  - Partial register stalls
    - mov al, 1
    - mov ah, 0
  - Function unit rules
    - Some instructions can never be paired
      - MUL, DIV, PUSHA, MOVS, some FP



#### Limitations of In-Order Pipelines

- If the machine parallelism is increased
  - ... dependencies reduce performance
  - CPI of in-order pipelines degrades sharply
    - As N approaches avg. distance between dependent instructions
    - Forwarding is no longer effective
  - Must stall often



#### The In-Order N-Instruction Limit

- On average, parent-child separation is about 5 insn
  - (Franklin and Sohi '92)



Average of 5 means there are many cases when the separation is < 4... each of these limits parallelism